Python Recap

Workshop 1

Registering a GitHub account

Before we get started, we need to set a few things up. GitHub is a platform for software development and version control using Git, allowing developers to store and manage their code. Think of it as google docs but for code– it will be very useful for collaborating on your group projects later in the term, and in your future as a data analyst.

Use this link to register for a GitHub account if you don’t already have one.
Once that’s done, create a new github repository called “QM2”.
In this notebook, click “File” and then “Save a copy in GitHub”.

Voila! You now have a version of this notebook saved to your own GitHub account. You will need to do step 3 for all the workshops! Now, on to python.

Using Python

In this course, we’ll make extensive use of Python, a programming language used widely in scientific computing and on the web. We will be using Python as a way to manipulate, plot and analyse data. This isn’t a course about learning Python, it’s about working with data - but we’ll learning a little bit of programming along the way.

By now, you should have done the prerequisites for the module, and understand a bit about how Python is structured, what different commands do, and so on - this is a bit of a refresher to remind you of what we need at the beginning of term.

The particular flavour of Python we’re using is iPython, which, as we’ve seen, allows us to combine text, code, images, equations and figures in a Notebook. This is a cell, written in markdown - a way of writing nice text. Contrast this with code cell, which executes a bit of Python:

print(2+2)

The Notebook format allows you to engage in what Don Knuth describes as Literate Programming:

[…] Instead of writing code containing documentation, the literate programmer writes documentation containing code. No longer does the English commentary injected into a program have to be hidden in comment delimiters at the top of the file, or under procedure headings, or at the end of lines. Instead, it is wrenched into the daylight and made the main focus. The “program” then becomes primarily a document directed at humans, with the code being herded between “code delimiters” from where it can be extracted and shuffled out sideways to the language system by literate programming tools. Ross Williams

Libraries

We will work with a number of libraries, which provide additional functions and techniques to help us to carry out our tasks.

These include:

Pandas: we’ll use this a lot to slice and dice data

matplotlib: this is our basic graphing software, and we’ll also use it for mapping

nltk: The Natural Language Tool Kit will help us work with text

We aren’t doing all this to learn to program. We could spend a whole term learning how to use Python and never look at any data, maps, graphs, or visualisations. But we do need to understand a few basics to use Python for working with data. So let’s revisit a few concepts that you should have covered in your prerequisites.

Variables

Python can broadly be divided in verbs and nouns: things which do things, and things which are things. In Python, the verbs can be commands, functions, or methods. We won’t worry too much about the distinction here - suffice it to say, they are the parts of code which manipulate data, calculate values, or show things on the screen.

The simplest proper noun object in Python is the variable. Variables are given names and store information. This can be, for example, numeric, text, or boolean (true/false). These are all statements setting up variables:

n = 1

t = “hi”

b = True

Now let’s try this in code:

n = 1

t = "hi"

b = True

Note that each command is on a new line; other than that, the syntax of Python should be fairly clear. We’re setting these variables equal to the letters and numbers and phrases and booleans. What’s a boolean?

The value of this is we now have values tied to these variables - so every time we want to use it, we can refer to the variable:

'hi'

True

Because we’ve defined these variables in the early part of the notebook, we can use them later on.

Advanced: where do classes fit into this noun/verb picture of variables and commands?

Where is my data?

When we work in excel and text editors, we’re used to seeing the data onscreen - and if we manipulate the data in some way (averaging or summing up), we see both the inputs and outputs on screen. The big difference in working with Python is that we don’t see our variables all of the time, or the effect we’re having on them. They’re there in the background, but it’s usually worth checking in on them from time to time, to see whether our processes are doing what we think they’re doing.

This is pretty easy to do - we can just type the variable name, or “print(variable name)”:

n = n+1
print(n)
print(t)
print(b)

2
hi
True

Flow

Python, in common with all programming languages, executes commands in a sequence - we might refer to this as the “ineluctable march of the machines”, but it’s more common referred to as the flow of the code (we’ll use the word “code” a lot - it just means commands written in the programming language). In most cases, code just executes in the order it’s written. This is true within each cell (each block of text in the notebook), and it’s true when we execute the cells in order; that’s why we can refer back to the variables we defined earlier:

print(n)

If we make a change to one of these variables, say n:

n = 3

and execute the above “print n” command, you’ll see that it has changed n to 3. So if we go out of order, the obvious flow of the code is confused. For this reason, try to write your code so it executes in order, one cell at a time. At least for the moment, this will make it easier to follow the logic of what you’re doing to data.

Advanced: what happens to this flow when you write functions to automate common tasks?

Exercise - Setting up variables:

Create a new cell.
Create the variables “name”, and assign your name to it.
Create a variable “Python” and assign a score out of 10 to how much you like Python.
Create a variable “prior” and if you’ve used Python before, assign True; otherwise assign False to the variable
Print these out to the screen

Downloading Data

Lets fetch the data we will be using for this session. There are two ways in which you can upload data to the Colab notebook. You can use the following code to upload a CSV or similar data file.

from google.colab import files
uploaded = files.upload()

Or you can use the following cell to fetch the data directly from the QM2 server.

Let’s create a folder that we can store all our data for this session

!mkdir data

!mkdir ./data/wk1
!curl https://s3.eu-west-2.amazonaws.com/qm2/wk1/data.csv -o ./data/wk1/data.csv
!curl https://s3.eu-west-2.amazonaws.com/qm2/wk1/sample_group.csv -o ./data/wk1/sample_group.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   203  100   203    0     0   2872      0 --:--:-- --:--:-- --:--:--  3029
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   297  100   297    0     0   1844      0 --:--:-- --:--:-- --:--:--  1879

Storing and importing data

Typically, data we look at won’t be just one number, or one bit of text. Python has a lot of different ways of dealing with a bunch of numbers: for example, a list of values is called a list:

listy = [1,2,3,6,9]
print(listy)

[1, 2, 3, 6, 9]

A set of values linked to an index (or key) is called a dictionary; for example:

dicty = {'Bob': 1.2, 'Mike': 1.2, 'Coop': 1.1, 'Maddy': 1.3, 'Giant': 2.1}
print(dicty)

{'Bob': 1.2, 'Mike': 1.2, 'Coop': 1.1, 'Maddy': 1.3, 'Giant': 2.1}

Notice that the list uses square brackets with values separated by commas, and the dict uses curly brackets with pairs separated by commas, and colons (:) to link a key (index or address) with a value.

(You might notice that they haven’t printed out in the order you entered them)

*Advanced: Print out 1) The third element of listy, and 2) The element of dicty relating to Giant

We’ll discuss different ways of organising data again soon, but for now we’ll look at dataframes - the way our data-friendly library Pandas works with data. We’ll be using Pandas a lot this term, so it’s good to get started with it early.

Let’s start by importing pandas. We’ll also import another library, but we’re not going to worry about that too much at the moment.

If you see a warning about ‘Building Font Cache’ don’t worry - this is normal.

import pandas

import matplotlib
%matplotlib inline

Let’s import a simple dataset and show it in pandas. We’ll use a pre-prepared “.csv” file, which needs to be in the same folder as our code.

data = pandas.read_csv('./data/wk1/data.csv')
data.head()

	Name	First Appearance	Approx height	Gender	Law Enforcement
0	Bob	1.2	6.0	Male	False
1	Mike	1.2	5.5	Male	False
2	Coop	1.1	6.0	Male	True
3	Maddy	1.3	5.5	Female	False
4	Giant	2.1	7.5	Male	False

What we’ve done here is read in a .csv file into a dataframe, the object pandas uses to work with data, and one that has lots of methods for slicing and dicing data, as we will see over the coming weeks. The head() command tells iPython to show the first few columns/rows of the data, so we can start to get a sense of what the data looks like and what sort of type of objects is represents.

A common first step for exploring our data is to sort it. In Pandas, this can be done easily with the sort_values() function. We can specify which column to sort the data by, and whether we want to sort in ascending or descending order, using the optional arguments by and ascending, respectively. In the example below, we’re sorting in descending order of height:

data.sort_values(by='Approx height', ascending=False).head()

	Name	First Appearance	Approx height	Gender	Law Enforcement
4	Giant	2.1	7.5	Male	False
0	Bob	1.2	6.0	Male	False
2	Coop	1.1	6.0	Male	True
1	Mike	1.2	5.5	Male	False
3	Maddy	1.3	5.5	Female	False

Supplementary: Kaggle exercises

If you’ve gotten this far, congratulations! To further hone your skills, try working your way through the five intro to programming notebooks on Kaggle. These cover a range of skills that we’ll be using throughout the term. Kaggle is a very useful resource for learning data science, so making an account may not be a bad idea!

Assessed Question

The URL below contains a dataset of the most streamed songs on spotify in 2023: https://storage.googleapis.com/qm2/wk1/spotify-2023.csv

Download the dataset and save it in the ./data/wk1/ directory.
Load the dataset as a pandas dataframe, and inspect it. Two of the column names have accidentally been swapped around. Use common sense to figure out which ones these are before proceeding with your analysis.
Filter the dataset to only contain songs in the key of C sharp.
Sort the dataframe in descending order of streams.

QUESTION: which artist has the song with the highest number of streams?

# use this code cell to answer the question